Working with Unlabeled Data – Cluster Analysis

Find the best number of clusters with k_means and agglomerative clustering

Overview

  1. Load the data file
    • check the shape and plot the content
  2. Observe the pair plot and comment the shapes in view of clustering
    1. if necessary, transform the data
  3. Use the elbow method to find the optimal number of clusters, to do this test KMeans with varying number of clusters, from 2 to 10: for each value of k
    • fit the data
    • compute the inertia and the silhouette score
    • store them for plot
  4. Plot inertia and silhouette score versus k
  5. Choose the optimal number of clusters looking at the plots
  6. Cluster the data using the optimal number, plot the cluster assignment
    • in the plot choose the features that seem to be most promising
  7. For comparison, repeat the same operation with the AgglomerativeClustering and DBSCAN
In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, silhouette_samples
from mpl_toolkits import mplot3d
from sklearn.model_selection import ParameterGrid
import warnings
warnings.filterwarnings("ignore")

random_state = 42 # This variable will be used in all the procedure calls allowing a random_state parameter
               # in this way the running can be perfectly reproduced
               # just change this value for a different experiment

1. Load the data file

Check the shape and plot the content

In [2]:
 
Out[2]:
(440, 8)
In [3]:
 
Out[3]:
Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
0 2 3 12669 9656 7561 214 2674 1338
1 2 3 7057 9810 9568 1762 3293 1776
2 2 3 6353 8808 7684 2405 3516 7844
3 1 3 13265 1196 4221 6404 507 1788
4 2 3 22615 5410 7198 3915 1777 5185

2. Observe the data distributions

In [4]:
 
In [5]:
 

We observe that the distributions of values are definitely skewed: in the columns from Fresh to Delicassen the values are highly concentrated on the right, but there are always outliers, frequently in a very large range.

Clustering is more effective in absence of outliers and with all the variables distributed in similar ranges, for this reason, we will execute two transformations:

  1. transform all the variables from the column Fresh to the column Delicassen with PowerTransformer
  2. remap all the variables in the range 0:1
In [6]:
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import MinMaxScaler
Out[6]:
Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
0 1.0 1.0 0.501828 0.667606 0.625238 0.208640 0.649941 0.412016
1 1.0 1.0 0.414266 0.670028 0.655690 0.458800 0.674852 0.445663
2 1.0 1.0 0.400077 0.653586 0.627297 0.499856 0.682752 0.654292
3 0.0 1.0 0.509368 0.369264 0.553550 0.636716 0.461095 0.446485
4 1.0 1.0 0.604755 0.580657 0.618985 0.566470 0.601884 0.590255

Show the result of the transformation

In [7]:
 
In [8]:
 

Now the effect of outliers is reduced, and we compute the clustering

3. Use the elbow method to find the optimal number of clusters

Test KMeans with varying number of clusters, from 2 to 10

Prepare the results list that will contain pairs of inertia and silhouette_score for each value of k, then, for each value of k

  • initialize an estimator for KMeans
  • fit the data and predict the cluster assignment for each individual with fit and predict
  • the inertia is provided in the attribute inertia_ of the fitted model
  • compute the silhouette score using the function silhouette_score from sklearn.metrics using as arguments the data and the fitted labels, we will fill the variable silhouette_scores
  • store the two values above in the list created at the beginning
In [9]:
 

4. Plot inertia and silhouette score versus k

In [10]:
 
In [11]:
 

5. Cluster with the optimal number

The two elbow points of inertia would suggest as cluster number 3 or 4, slightly more pronounced in 3. Silhouette has a maximum on 4, but the increase with respect to 3 is very small.

We will choose k=4

In [12]:
 
Number of clusters = 4	- Distortion =  54.49	- Silhouette score = 0.49

Show the distribution of samples in the clusters with a pie chart

In [13]:
 

Comments

The silhouette score ranges from -1 (worst) to 1 (best); as a rule of thumb, a value greater than 0.5 should be considered acceptable.

Agglomerative clustering

We will try a grid of parameter configurations, with the number of clusters in the range 2:10 and the four linkage methods available in the sklearn implementation of AgglomerativeClustering.

In [14]:
from sklearn.cluster import AgglomerativeClustering
In [15]:
 
Out[15]:
linkage n_clusters silhouette_score
0 ward 2 0.499350
9 complete 2 0.499350
27 single 2 0.499350
18 average 2 0.499350
1 ward 3 0.498333
2 ward 4 0.493085
20 average 4 0.479507
11 complete 4 0.479507
21 average 5 0.475050
22 average 6 0.443867
23 average 7 0.429079
24 average 8 0.417343
33 single 8 0.402861
10 complete 3 0.402228
19 average 3 0.402228
13 complete 6 0.358410
12 complete 5 0.350571
34 single 9 0.349389
29 single 4 0.343557
28 single 3 0.343530
25 average 9 0.343345
3 ward 5 0.342252
26 average 10 0.338956
4 ward 6 0.333632
35 single 10 0.332562
5 ward 7 0.318470
30 single 5 0.292567
6 ward 8 0.285732
14 complete 7 0.280073
32 single 7 0.257306
7 ward 9 0.256243
8 ward 10 0.242371
15 complete 8 0.224454
31 single 6 0.212586
16 complete 9 0.185725
17 complete 10 0.166637

The top five results have a very similar silhouette score, we will choose the setting with 4 clusters, as for k-means, and the linkage giving the best result with 4 clusters, that is ward. This is the result record with index 2 (the record index is the unnamed column at the very left of the dataframe output

In [16]:
 
  linkage  n_clusters  silhouette_score
2    ward           4          0.493085
In [17]:
 

Show the distribution of data in the clusters

In [18]:
 
In [19]:
 
In [20]:
 

Comments

The solution with the Agglomerative Clustering in this case provides a result very similar to that of kmeans.

It is interesting to compare more deeply the results of the two clustering models.

The function pair_confusion_matrix computes the number of pairs of objects that are in the same clusters or in different clusters in two different clustering schemes.

The result is given in a 2x2 matrix, the smaller the numbers out of the main diagona, the better the match.

For easier readability, we divide all the elements of the matrix by the sum of all the elements of the matrix, in this way, the matrix elements are normalized to 1.

In [21]:
from sklearn.metrics import pair_confusion_matrix
Out[21]:
array([[0.66625595, 0.00218472],
       [0.00089045, 0.33066888]])

A short indicator of the match percentage can be optained as a sum of the elements of the main diagonal.

In [22]:
 
The percentage of match between the two clustering schemes is  99.69%

DBSCAN

In [23]:
from sklearn.cluster import DBSCAN
In [24]:
 
Default parameters - Eps=0.5 - Min_Samples=5
In [25]:
 

Show the distribution of data in the clusters

In [26]:
 

5. Find the best parameters using ParameterGrid

prepare the dictionary of the parameter ranges and produce the list of parameter settings to test with the function ParameterGrid

In [27]:
 

Arrange DBSCAN results in a dataframe, for easier presentation and filtering

dbscan_out = pd.DataFrame(columns = ['eps','min_samples','n_clusters','silhouette', 'unclust%'])

  • for each parameter setting
In [28]:
 
In [29]:
 
Out[29]:
eps min_samples n_clusters silhouette unclust%
652 0.506 2.0 4.0 0.529010 0.909091
653 0.506 3.0 4.0 0.529010 0.909091
654 0.506 4.0 4.0 0.529010 0.909091
655 0.506 5.0 4.0 0.529010 0.909091
656 0.506 6.0 4.0 0.529010 0.909091
... ... ... ... ... ...
743 0.561 5.0 2.0 0.500188 0.227273
744 0.561 6.0 2.0 0.500188 0.227273
745 0.561 7.0 2.0 0.500188 0.227273
746 0.561 8.0 2.0 0.500188 0.227273
747 0.561 9.0 2.0 0.500188 0.227273

96 rows × 5 columns

In [30]:
 
Out[30]:
eps min_samples n_clusters silhouette unclust%
0 0.051 2.0 4.0 0.809127 98.181818
55 0.131 5.0 4.0 0.577401 75.681818
114 0.166 8.0 4.0 0.564877 59.090909
123 0.171 9.0 4.0 0.556483 57.500000
122 0.171 8.0 4.0 0.554163 55.454545
131 0.176 9.0 4.0 0.552283 54.772727
130 0.176 8.0 4.0 0.551633 53.181818
139 0.181 9.0 4.0 0.549653 53.863636
138 0.181 8.0 4.0 0.549119 51.590909
120 0.171 6.0 4.0 0.546914 49.545455
111 0.166 5.0 4.0 0.545816 50.681818
128 0.176 6.0 4.0 0.544325 46.363636
147 0.186 9.0 4.0 0.538980 50.000000
119 0.171 5.0 4.0 0.537250 46.590909
652 0.506 2.0 4.0 0.529010 0.909091
653 0.506 3.0 4.0 0.529010 0.909091
654 0.506 4.0 4.0 0.529010 0.909091
655 0.506 5.0 4.0 0.529010 0.909091
656 0.506 6.0 4.0 0.529010 0.909091
657 0.506 7.0 4.0 0.529010 0.909091
658 0.506 8.0 4.0 0.529010 0.909091
659 0.506 9.0 4.0 0.529010 0.909091
162 0.196 8.0 4.0 0.516842 37.727273
163 0.196 9.0 4.0 0.514844 41.136364
171 0.201 9.0 4.0 0.513659 39.318182
179 0.206 9.0 4.0 0.511789 36.363636
161 0.196 7.0 4.0 0.510886 35.909091
170 0.201 8.0 4.0 0.509058 35.909091
187 0.211 9.0 4.0 0.503088 34.090909
169 0.201 7.0 4.0 0.502949 34.545455
178 0.206 8.0 4.0 0.499673 33.181818
186 0.211 8.0 4.0 0.498606 32.954545
195 0.216 9.0 4.0 0.490289 32.045455
203 0.221 9.0 4.0 0.485615 29.772727
194 0.216 8.0 4.0 0.485379 30.454545
211 0.226 9.0 4.0 0.483599 28.181818
202 0.221 8.0 4.0 0.483369 28.409091
22 0.106 4.0 4.0 0.478732 91.590909
28 0.111 5.0 4.0 0.455728 92.045455
64 0.136 6.0 4.0 0.450831 77.045455
40 0.121 6.0 4.0 0.412454 90.000000
19 0.101 5.0 4.0 0.400643 95.454545
72 0.141 6.0 4.0 0.397130 71.818182
98 0.156 8.0 4.0 0.382141 69.318182
129 0.176 7.0 4.0 0.354982 49.318182
106 0.161 8.0 4.0 0.353064 67.045455
39 0.121 5.0 4.0 0.282229 87.045455

6. Observe

  • Observe visually the most promising combination of parameters reproducing the same number of clusters of the previous experiments
  • fit and predict the cluster labels and show a pairplot
In [31]:
 
In [32]:
 

Show the cluster sizes

In [33]:
 

Silhouette plots

Use the function plot_silhouette contained in the module plot_silhouette_w_mean (provided with the notebook) and from sklearn.metrics import the function silhouette_samples providing the silhouette score for each sample

from plot_silhouette_w_mean import plot_silhouette # python script provided separately

Hint: use help(plot_silhouette) for the meaning of the parameters

Hint: for DBSCAN you should exclude the rows of noise

In [34]:
from plot_silhouette_w_mean import plot_silhouette  # python script provided separately

Optional - Comparison with boxplots

For each of the clustering schemes show how the attribute values are distributed in the clusters

In [35]:
 
In [36]:
 
In [37]:
 

Control questions

  1. Repeat the experiments without the data transformations and comment the result
  2. Repeat the final fittings with the numbers of clusters immediately before and after the chosen values and comment the results